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CLAIMS 



WE CLAIM: 



1 C^yJ 1 \ A method for classifying electronically posted documents, the method comprising: 

2 ^ V> \ receiving a first document and a second document; 

3 \ generating a first metadata summary corresponding to said first document and a 

4 second metadata summary corresponding to the second document, wherein the first metadata 

5 summary includes a first summary sub-tree and the second metadata summary includes a second 
summary sub-free; 

f i comparing the structure of the first summary sub-tree with the structure of the second 

summary sub-tred^ and 

U . 

Sp identifying the first and second documents as distinct if the structures of the first and 

in 

Kjj^ second summary sutV-trees are not equivalent. 

£3 2. The method of claim 1, wherein the first summary sub-tree includes at least one 

fy \ 

2y attribute having a first attribute value, and wherein the second summary sub-tree includes at least one 

2p^ attribute having a second attribute value, the method further comprising: 

comparing, fqj* each of the at least one attributes, the first and second attribute values; 

5 and 

6 identifying the ftfst and second documents as distinct if the attribute values of the first 

7 and second summary sub-trees areVot equivalent. 

1 3. The method of claim 1 Wherein the first summary sub-tree includes text content, and 

2 wherein the second summary sub-tree includes text content, the method further comprising: 

3 comparing the text content^ncluded within the first and second summary sub-trees; 

4 and 

5 identifying the first and second documents as distinct if the text content of the first 

6 and second summary sub-trees are not equivalent. 



AM999074 



12 



1 4. The method of claim 2, wherein the first summary sub-tree further includes text 

2 content, and wherein the second summary sub-tree includes text content, the method further 

3 comprising: \^ 

4 comparing the text content included within the first and second summary sub-trees; 

5 and \ 

6 identifying the first anch second documents as distinct if the text content included 

7 within the first and second summary suWtrees are not equivalent. 

1 5. The method of claim 4, funher comprising identifying the first and second documents 

93 as duplicates if the text content within theWst and second summary sub-trees are equivalent. 

I F* \ 
Ibf \ 

ww V 

jl. x 6. The method of claim 5, further comprising removing the second metadata summary 

from the first summary group if the structuresW the first and second summary sub-trees are 

19 equivalent and if the first summary value is equivalent to the second summary value for each of the 

*h at least one attributes. \ 

ffi \ 

7. The method of claim 1 , further comprising: 

P \ 

23 defining a first equivalence metadata liable comprising: 

3 a first row corresponding to theYfirst metadata summary; 

4 a second row corresponding to the second metadata summary; 

5 a first column corresponding to the; first metadata summary; and 

6 a second column corresponding to the second metadata summary, wherein the 

7 process of identifying the first and second documents as distinct if the structures of the first and 

8 second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 

9 and second column position of the equivalence metadata summary. 

1 8. The method of claim 2, further comprising: \ 

2 defining a first equivalence metadata table comprising: 

3 a first row corresponding to the first metadata Wnmary; 

4 a second row corresponding to the second metadata summary; 



AM999074 




5 a first column corresponding to the first metadata summary; and 

6 a second column corresponding to the second metadata summary, wherein the 

7 process of identifying the first and second documents as distinct if the attribute values of the first and 

8 second summary sub-trees aranot equivalent comprises storing a zero binary value in the first row 

9 and second column position ofrhe equivalence metadata summary. 
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9. The method of claim 3, further comprising: 
defining a first equivalence metadata table comprising: 

a first row copesponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the text content of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 

10. A method for classifying electronically posted documents, the method comprising: 
receiving a plurality of documents; 

generating a respective plurality of metadata summaries corresponding to the plurality 
of received documents; 

grouping a first subset of the respective plurality of metadata summaries into a first 
summary group, the first summary group comprising summaries having a first mime-type 
designation; 

selecting a first metadata summary ztaid a second metadata summary from the first 
summary group, wherein the first metadata summaryyincludes a first summary sub-tree and the 
second metadata summary includes a second summary sub-tree; 

comparing the structure of the first summary sub-tree with the structure of the second 
summary sub-tree; and 

identifying the first and second documents^ distinct if the structures of the first and 
second summary sub-trees are not equivalent. 
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1 11. The method of claim 10, wherein grouping further comprises grouping a second 

2 subset of the respective metadata summaries into a second summary group, the second summary 

3 group comprising summaries having a second mime-type designation. 



O 



1 12. A system for classifying electronically posted documents, the system comprising: 

2 a metadata parser module coupled to receive electronically posted documents, the 

3 metadata parser configured to output respective metadata summaries, wherein each respective 

4 metadata summary comprises one of more sub-trees structures, one or more attributes, and content 

5 text; 

6 a summary repository <\oupled to receive and store the respective metadata 
% summaries; and 

jeH a summary consolidator Coupled to the summary repository, the summary 

consolidator configured to delete duplicate metadata summaries from the summary repository. 



in . 

}n 13. The system of claim 12, wherein the summary consolidator comprises: 

!L a sub-tree comparator configured to compare one or more sub-tree structures of the 

§U retrieved metadata summaries; 

ru 

I; an attribute comparator configure^ to compare the attribute values of the retrieved 

f =1 

5~ metadata summaries; and 

6 a text comparator configured to compare the text content included within the retrieved 

7 metadata summaries. 

1 14. The system of claim 1 3, wherein the sub\tree comparator is configured to compare the 

2 metadata portion of the metadata summary. 

1 15. The system of claim 1 3 , wherein the attribute comparator is configured to compare 

2 the attribute values included within the metadata portion of the metadata summary. 

1 16. The system of claim 13, wherein the text comparator is configured to compare the 

2 text content included within the metadata portion of the metadata summary. 
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1 17. A program product for use in a computer system that executes program steps recorded 

2 in a computer-readable media )o perform a method for classifying electronically posted documents, 

3 the program product comprising: 

4 a recordable med\a; 

5 a program of computer-readable instructions executable by the computer system to 

6 perform processes comprising: ^ 

7 receiving a first document and a second document; 

8 generating a first metadata summary corresponding to said first document and 

9 a second metadata summary corresponding to the second document, wherein the first metadata 

O \ 

lgg summary includes a first summary sub-tree and the second metadata summary includes a second 

C S3 

lj; summary sub-tree; 

1|J comparing the structure of the first summary sub-tree with the structure of the 

I3f} second summary sub-tree; and 

l|" J identifying the first knd second documents as distinct if the structures of the 

li J first and second summary sub-trees are not equivalent. 

fy 

ru 

IE 18. The program product of claim tJ 9 wherein the first summary sub-tree includes at least 



|S one attribute having a first attribute value, and wherein the second summary sub-tree includes at least 

3 one attribute having a second attribute value, the program product method further comprising the 

4 processes of: 

5 comparing, for each of the at least onh x attributes, the first and second attribute values; 

6 and 

7 identifying the first and second documents as distinct if the attribute values of the first 

8 and second summary sub-trees are not equivalent. 

1 19. The program product of claim 18, wherein the first summary sub-tree includes text 

2 content, and wherein the second summary sub-tree includes pxt content, the program product further 

3 comprising the processes of: 

4 comparing the text content included within the Ijrst and second summary sub-trees; 

5 and 
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identifying the first and second documents as distinct if the text content of the first 
and second summary sub-trees are not equivalent!. 

20. The program product of claim 1 9/ further comprising the method step of identifying 
the first and second documents as duplicates if ,the text content within the first and second summary 
sub-trees are equivalent. 

21. The program product of claim 20, further comprising the process of removing the 
second metadata summary from the first sunn iary group. 

22. The program product of claim 21 , further comprising the processes of: 
defining a first equivalence metadata table comprising: 

a first row corresponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second < locuments as distinct if the text content of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equival mce metadata summary. 

23. The method of claim 18, fin ther comprising the processes of: 
defining a first equivalence metadata table comprising: 

a first row correspon ling to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column coi responding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the attribute values of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 
and second column position of the equivalence metadata summary. 



24. 



The method of claim 19, furtHer comprising the processes of: 
defining a first equivalence metadata table comprising: 
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a first row corresponding to the first metadata summary; 

a second row corresponding to the second metadata summary; 

a first column corresponding to the first metadata summary; and 

a second column corresponding to the second metadata summary, wherein the 
process of identifying the first and second documents as distinct if the text content of the first and 
second summary sub-trees are not equivalent comprises storing a zero binary value in the first row 

— t— 

25. A program product for use in a computer system that executes program steps recorded 
in a computer-readable media to perform a method jfor classifying electronically posted documents, 
the program product comprising: 
a recordable media; 

a program of computer-readable ii^structions executable by the computer system to 
perform method steps comprising: 

receiving a plurality of documents; 

generating a respective plurality of metadata summaries corresponding to the 
plurality of received documents; 

grouping a first subset ofjthe respective plurality of metadata summaries into a 
first summary group, the first summary group Comprising summaries having a first mime-type 
designation; 

selecting a first metadata summary and a second metadata summary from the 
first summary group, wherein the first metadata summary includes a first summary sub-tree and the 
second metadata summary includes a second pimmary sub-tree; 

comparing the structure of the first summary sub-tree with the structure of the 
second summary sub-tree; and 

identifying the first add second documents as distinct if the structures of the 
first and second summary sub-trees are not < equivalent. 



26. The program product of claim 25, wherein the step of grouping further comprises the 
step of grouping a second subset of the resp sctive metadata summaries into a second summary 
group, the second summary group comprising summaries having a second mime-type designation. 



